{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Youtube Video Summarization with Voice Cloning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "Our script performs several tasks:\n",
    "\n",
    "1. downloads and processes a YouTube video,\n",
    "2. transcribes the audio from the YouTube video,\n",
    "3. summarizes the transcription of the transcribed audio, and\n",
    "4. converts the summary to speech using the user's voice.\n",
    "\n",
    "The script leverages Covalent for executing these tasks, either locally or on a cloud platform like GCP.\n",
    "\n",
    "## Import Dependencies\n",
    "\n",
    "First, we import necessary Python libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import streamlit as st\n",
    "\n",
    "from pytube import YouTube\n",
    "from pydub import AudioSegment\n",
    "from transformers import pipeline\n",
    "from TTS.api import TTS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section loads libraries for audio processing, machine learning models, and Covalent for workflow management.\n",
    "\n",
    "```bash\n",
    "covalent deploy up gcpbatch\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set Up Covalent and Dependencies\n",
    "Covalent simplifies cloud resource management. We define dependencies for each task and configure a Covalent executor for cloud execution. \n",
    "In the example below we utilize [Google Cloud Batch](https://cloud.google.com/batch) using our [gcp batch executor](https://docs.covalent.xyz/docs/user-documentation/api-reference/executors/gcp/). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio_deps = [\n",
    "    \"transformers==4.33.3\", \"pydub==0.25.1\",\n",
    "    \"torchaudio==2.1.0\", \"librosa==0.10.0\",\n",
    "    \"torch==2.1.0\"\n",
    "]\n",
    "text_deps = [\"transformers==4.33.3\", \"torch==2.1.0\"]\n",
    "tts_deps = audio_deps + [\"TTS==0.19.1\"]\n",
    "\n",
    "executor = ct.executor.GCPBatchExecutor(\n",
    "    container_image_uri=\"docker.io/filipbolt/covalent-gcp-0.229.0rc0\",\n",
    "    vcpus=4,\n",
    "    memory=8192,\n",
    "    time_limit=3000,\n",
    "    poll_freq=1,\n",
    "    retries=1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you may use Covalent Cloud to execute this workflow by doing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import covalent_cloud as cc\n",
    "\n",
    "cc_executor = cc.CloudExecutor(num_cpus=4, env=\"genai-env\", memory=8192, time_limit=3000)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Covalent Tasks\n",
    "Each step of our workflow is encapsulated in a Covalent task. Here's an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "@ct.electron\n",
    "def download_video(url):\n",
    "    yt = YouTube(url)\n",
    "    # download file\n",
    "    out_file = yt.streams.filter(\n",
    "        only_audio=True, file_extension=\"mp4\"\n",
    "    ).first().download(\".\")\n",
    "\n",
    "    # rename downloaded file\n",
    "    os.rename(out_file, \"audio.mp4\")\n",
    "    with open(\"audio.mp4\", \"rb\") as f:\n",
    "        file_content = f.read()\n",
    "    return file_content\n",
    "\n",
    "\n",
    "@ct.electron\n",
    "def load_audio(input_file_content):\n",
    "    input_path = os.path.join(os.getcwd(), \"file.mp4\")\n",
    "    # write to file\n",
    "    with open(input_path, \"wb\") as f:\n",
    "        f.write(input_file_content)\n",
    "\n",
    "    audio_content = AudioSegment.from_file(input_path, format=\"mp4\")\n",
    "    return audio_content\n",
    "\n",
    "\n",
    "@ct.electron(executor=executor, deps_pip=audio_deps)\n",
    "def transcribe_audio(audio_content):\n",
    "    # Export the audio as a WAV file\n",
    "    audio_content.export(\"audio_file.wav\", format=\"wav\")\n",
    "\n",
    "    pipe = pipeline(\n",
    "        task=\"automatic-speech-recognition\",\n",
    "        # model=\"openai/whisper-small\",\n",
    "        model=\"openai/whisper-large-v3\",\n",
    "        chunk_length_s=30, max_new_tokens=2048,\n",
    "    )\n",
    "    transcription = pipe(\"audio_file.wav\")\n",
    "    return transcription['text']\n",
    "\n",
    "\n",
    "@ct.electron(executor=executor, deps_pip=text_deps)\n",
    "def summarize_transcription(transcription):\n",
    "    summarizer = pipeline(\n",
    "        \"summarization\",\n",
    "        model=\"facebook/bart-large-cnn\",\n",
    "    )\n",
    "    summary = summarizer(\n",
    "        transcription, min_length=5, max_length=100,\n",
    "        do_sample=False, truncation=True\n",
    "    )[0][\"summary_text\"]\n",
    "    return summary\n",
    "\n",
    "\n",
    "@ct.electron(executor=executor, deps_pip=tts_deps)\n",
    "def text_to_speech_voice_clone(text, speaker_content, output_file):\n",
    "    with open(\"speaker.wav\", \"wb\") as f:\n",
    "        f.write(speaker_content)\n",
    "\n",
    "    # agree to service agreement programmatically\n",
    "    os.environ['COQUI_TOS_AGREED'] = \"1\"\n",
    "\n",
    "    tts = TTS(\"tts_models/multilingual/multi-dataset/xtts_v1\")\n",
    "    tts.tts_to_file(\n",
    "        text=text,\n",
    "        file_path=output_file,\n",
    "        speaker_wav=\"speaker.wav\",\n",
    "        language=\"en\"\n",
    "    )\n",
    "    with open(output_file, \"rb\") as f:\n",
    "        file_content = f.read()\n",
    "    return file_content\n",
    "\n",
    "\n",
    "@ct.electron\n",
    "def load_wav_file(wav_file):\n",
    "    with open(wav_file, \"rb\") as f:\n",
    "        file_content = f.read()\n",
    "    return file_content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the `@ct.electron` decorator to define tasks like `download_video`,`mp4_to_wav`, `transcribe_audio`, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Orchestrate the Workflow\n",
    "The `@ct.lattice` decorator is used to define the workflow that orchestrates the entire process:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "@ct.lattice\n",
    "def workflow(url, user_voice_file, output_file):\n",
    "    video_content = download_video(url)\n",
    "    audio_content = load_audio(video_content)\n",
    "\n",
    "    user_voice_content = load_wav_file(user_voice_file)\n",
    "\n",
    "    # Use Google Cloud Batch to transcribe, summarize and re-voice\n",
    "    transcription = transcribe_audio(audio_content)\n",
    "    summary = summarize_transcription(transcription)\n",
    "    output_file_content = text_to_speech_voice_clone(\n",
    "        summary, user_voice_content, output_file\n",
    "    )\n",
    "    return summary, transcription, output_file_content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Streamlit Interface\n",
    "We use Streamlit to create an interactive web interface for the script:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-01-08 14:37:20.421 \n",
      "  \u001b[33m\u001b[1mWarning:\u001b[0m to view this Streamlit app on a browser, run it with the following\n",
      "  command:\n",
      "\n",
      "    streamlit run /home/filip/miniconda3/envs/google/lib/python3.9/site-packages/ipykernel_launcher.py [ARGUMENTS]\n",
      "2024-01-08 14:37:20.422 Session state does not function when running a script without `streamlit run`\n"
     ]
    }
   ],
   "source": [
    "import streamlit as st\n",
    "\n",
    "# Function to display results\n",
    "def display_results(summary, transcription, audio_file_content):\n",
    "    display_summary(summary)\n",
    "    display_full_transcription(transcription)\n",
    "    display_audio_summary(audio_file_content)\n",
    "\n",
    "# Function to display the summary\n",
    "def display_summary(summary):\n",
    "    st.subheader(\"YouTube transcription summary:\")\n",
    "    st.text(summary)\n",
    "\n",
    "# Function to display the full transcription with a toggle\n",
    "def display_full_transcription(transcription):\n",
    "    st.subheader(\"YouTube full transcription\")\n",
    "    if st.checkbox(\"Show/Hide\", False):\n",
    "        st.text(transcription)\n",
    "\n",
    "# Function to display the audio summary\n",
    "def display_audio_summary(audio_file_content):\n",
    "    st.subheader(\"Summary in your own voice:\")\n",
    "    st.audio(audio_file_content, format=\"audio/wav\")\n",
    "\n",
    "\n",
    "# Streamlit app layout\n",
    "def main():\n",
    "    st.title(\"Summarize YouTube videos in your own voice using AI\")\n",
    "    speaker_file, speaker_file_path = upload_speaker_file()\n",
    "    youtube_url = st.text_input(\"Enter valid YouTube URL\")\n",
    "\n",
    "    if st.button(\"Process\"):\n",
    "        process_input(speaker_file, speaker_file_path, youtube_url)\n",
    "    elif \"transcription\" in st.session_state:\n",
    "        display_results(\n",
    "            st.session_state[\"summary\"],\n",
    "            st.session_state[\"transcription\"],\n",
    "            st.session_state[\"audio_file_content\"]\n",
    "        )\n",
    "\n",
    "# Function to upload speaker file\n",
    "def upload_speaker_file():\n",
    "    speaker_file = st.file_uploader(\"Upload an audio file (WAV)\", type=[\"wav\"])\n",
    "    if speaker_file:\n",
    "        st.audio(speaker_file, format=\"audio/wav\")\n",
    "        speaker_file_path = \"speaker.wav\"\n",
    "        with open(speaker_file_path, \"wb\") as f:\n",
    "            f.write(speaker_file.getbuffer())\n",
    "        return speaker_file, speaker_file_path\n",
    "    return None, None\n",
    "\n",
    "# Function to process the input\n",
    "def process_input(speaker_file, speaker_file_path, youtube_url):\n",
    "    if speaker_file and youtube_url:\n",
    "        audio_file_full_path = os.path.join(os.getcwd(), \"audio.wav\")\n",
    "        speaker_file_full_path = os.path.join(os.getcwd(), speaker_file_path)\n",
    "\n",
    "        dispatch_id = ct.dispatch(workflow)(\n",
    "            youtube_url, speaker_file_full_path, audio_file_full_path\n",
    "        )\n",
    "        with st.spinner(f\"Processing... job dispatch id: {dispatch_id}\"):\n",
    "            result = ct.get_result(dispatch_id, wait=True)\n",
    "\n",
    "        if result:\n",
    "            summary, transcription, output_file_content = result.result\n",
    "            st.session_state[\"transcription\"] = transcription\n",
    "            st.session_state[\"summary\"] = summary\n",
    "            st.session_state[\"audio_file_content\"] = output_file_content\n",
    "            display_results(summary, transcription, output_file_content)\n",
    "        else:\n",
    "            st.error(\"Something went wrong. Please try again.\")\n",
    "\n",
    "main()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running the Script\n",
    "To run the script, execute it via Streamlit:\n",
    "\n",
    "```bash\n",
    "streamlit run your_script.py\n",
    "```\n",
    "\n",
    "In the Covalent UI, you should be seeing a workflow like the following\n",
    "\n",
    "![alt text](assets/workflow.gif)\n",
    "\n",
    "The streamlit app usage would then be:\n",
    "\n",
    "![alt text](assets/streamlit_gcp.gif)\n",
    "\n",
    "### Customizing the Workflow\n",
    "You can tailor this script to your specific needs:\n",
    "\n",
    "- Modify the Covalent task functions for different processing requirements like swapping one of the models.\n",
    "- Adjust the Covalent executor settings based on your cloud resource needs.\n",
    "\n",
    "### Conclusion\n",
    "\n",
    "This tutorial demonstrates using Covalent for fine-tuning a speech summarization model. Covalent's cloud computing abstraction simplifies executing complex workflows, making it a powerful tool for developers and researchers in AI/ML fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
